Search CORE

70 research outputs found

Quality Assessment of Linked Datasets using Probabilistic Approximation

Author: A Hogan
AZ Broder
BH Bloom
C Guéret
JS Vitter
P Hitzler
Publication venue
Publication date: 17/03/2015
Field of study

With the increasing application of Linked Open Data, assessing the quality of datasets by computing quality metrics becomes an issue of crucial importance. For large and evolving datasets, an exact, deterministic computation of the quality metrics is too time consuming or expensive. We employ probabilistic techniques such as Reservoir Sampling, Bloom Filters and Clustering Coefficient estimation for implementing a broad set of data quality metrics in an approximate but sufficiently accurate way. Our implementation is integrated in the comprehensive data quality assessment framework Luzzu. We evaluated its performance and accuracy on Linked Open Datasets of broad relevance.Comment: 15 pages, 2 figures, To appear in ESWC 2015 proceeding

arXiv.org e-Print Archive

Crossref

Fraunhofer-ePrints

Cache-oblivious index for approximate string matching

Author: Hon WK
Lam TW
Shah R
Tam SL
Vitter JS
Publication venue: 'Elsevier BV'
Publication date: 01/01/2011
Field of study

This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text T of length n and a positive integer k, we want to construct an index of T such that for any input pattern P, we can find all its k-error matches in T efficiently. This problem is well-studied in the internal-memory setting. Here, we extend some of these recent results to external-memory solutions, which are also cache-oblivious. Our first index occupies O((nlog kn)B) disk pages and finds all k-error matches with O((|P|+occ)B+log knloglog Bn) I/Os, where B denotes the number of words in a disk page. To the best of our knowledge, this index is the first external-memory data structure that does not require Ω (|P|+occ+poly(logn)) I/Os. The second index reduces the space to O((nlogn)B) disk pages, and the I/O complexity is O((|P|+occ)B+log k(k+1)nloglogn) . © 2011 Elsevier B.V. All rights reserved.postprin

Elsevier - Publisher Connector

HKU Scholars Hub

A Bulk-Parallel Priority Queue in External Memory with STXXL

Author: GS Brodal
J Singler
JS Vitter
L Arge
MC Pinotti
N Deo
P Sanders
P Sanders
PJ Varman
R Dementiev
Publication venue
Publication date: 01/01/2015
Field of study

We propose the design and an implementation of a bulk-parallel external memory priority queue to take advantage of both shared-memory parallelism and high external memory transfer speeds to parallel disks. To achieve higher performance by decoupling item insertions and extractions, we offer two parallelization interfaces: one using "bulk" sequences, the other by defining "limit" items. In the design, we discuss how to parallelize insertions using multiple heaps, and how to calculate a dynamic prediction sequence to prefetch blocks and apply parallel multiway merge for extraction. Our experimental results show that in the selected benchmarks the priority queue reaches 75% of the full parallel I/O bandwidth of rotational disks and and 65% of SSDs, or the speed of sorting in external memory when bounded by computation.Comment: extended version of SEA'15 conference pape

arXiv.org e-Print Archive

Crossref

KITopen

Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array

Author: D Okanohara
J Fischer
J Fischer
J Fischer
J Kärkkäinen
J Kärkkäinen
J Kärkkäinen
J Sirén
JI Munro
JS Vitter
K Sadakane
K Sadakane
P Ferragina
P Ferragina
P Ferragina
R Dementiev
T Beller
T Kasai
U Manber
W Hon
W Szpankowski
Publication venue
Publication date: 01/01/2016
Field of study

The longest common prefix (LCP) array is a versatile auxiliary data structure in indexed string matching. It can be used to speed up searching using the suffix array (SA) and provides an implicit representation of the topology of an underlying suffix tree. The LCP array of a string of length

n

can be represented as an array of length

n

words, or, in the presence of the SA, as a bit vector of

2n

bits plus asymptotically negligible support data structures. External memory construction algorithms for the LCP array have been proposed, but those proposed so far have a space requirement of

O(n)

words (i.e.

O(n \log n)

bits) in external memory. This space requirement is in some practical cases prohibitively expensive. We present an external memory algorithm for constructing the

2n

bit version of the LCP array which uses

O(n \log \sigma)

bits of additional space in external memory when given a (compressed) BWT with alphabet size

\sigma

and a sampled inverse suffix array at sampling rate

O(\log n)

. This is often a significant space gain in practice where

\sigma

is usually much smaller than

n

or even constant. We also consider the case of computing succinct LCP arrays for circular strings

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Fading histograms in detecting distribution and concept changes

Author: A Bifet
AC Gilbert
D Ayres-de Campos
ES Page
G Cormode
G Cormode
G Widmer
H Gonçalves
J Gama
J Gama
J Misra
JS Vitter
M Basseville
P Karras
S Guha
S Kullback
TM Mitchell
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

The remarkable number of real applications under dynamic scenarios is driving a novel ability to generate and gatherinformation.Nowadays,amassiveamountofinforma- tion is generated at a high-speed rate, known as data streams. Moreover, data are collected under evolving environments. Due to memory restrictions, data must be promptly processed and discarded immediately. Therefore, dealing with evolving data streams raises two main questions: (i) how to remember discarded data? and (ii) how to forget outdated data? To main- tain an updated representation of the time-evolving data, this paper proposes fading histograms. Regarding the dynamics of nature, changes in data are detected through a windowing scheme that compares data distributions computed by the fading histograms: the adaptive cumulative windows model (ACWM). The online monitoring of the distance between data distributions is evaluated using a dissimilarity measure based on the asymmetry of the Kullback–Leibler divergence.The experimental results support the ability of fading his- tograms in providing an updated representation of data. Such property works in favor of detecting distribution changes with smaller detection delay time when compared with stan- dard histograms. With respect to the detection of concept changes, the ACWM is compared with 3 known algorithms taken from the literature, using artificial data and using pub- lic data sets, presenting better results. Furthermore, we the proposed method was extended for multidimensional and the experiments performed show the ability of the ACWM for detecting distribution changes in these settings

Crossref

Repositório Institucional da Universidade de Aveiro

Refinement type contracts for verification of scientific investigative software

Author: A Johanson
A Saltelli
AH Kamali
D Heaton
D Kelly
D Paine
David A. W. Soergel
E Bullmore
E Giannoulatou
E Goubault
EJ Weyuker
G Miller
Gopal P. Sarma
IM Park
J Hatcliff
J Siek
JA Gunnels
Joe W. Duran
JS Vitter
K Patel
L Hatton
L Hochstein
LN Joppa
M Eilers
M Shinn
Mike Barnett
MO Gewaltig
R Arai
R Ratcliff
R Sanders
T Chen
T Herndon
TL Clune
U Kanewala
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2019
Field of study

Our scientific knowledge is increasingly built on software output. User code which defines data analysis pipelines and computational models is essential for research in the natural and social sciences, but little is known about how to ensure its correctness. The structure of this code and the development process used to build it limit the utility of traditional testing methodology. Formal methods for software verification have seen great success in ensuring code correctness but generally require more specialized training, development time, and funding than is available in the natural and social sciences. Here, we present a Python library which uses lightweight formal methods to provide correctness guarantees without the need for specialized knowledge or substantial time investment. Our package provides runtime verification of function entry and exit condition contracts using refinement types. It allows checking hyperproperties within contracts and offers automated test case generation to supplement online checking. We co-developed our tool with a medium-sized (

\approx

3000 LOC) software package which simulates decision-making in cognitive neuroscience. In addition to helping us locate trivial bugs earlier on in the development cycle, our tool was able to locate four bugs which may have been difficult to find using traditional testing methods. It was also able to find bugs in user code which did not contain contracts or refinement type annotations. This demonstrates how formal methods can be used to verify the correctness of scientific software which is difficult to test with mainstream approaches

arXiv.org e-Print Archive

Crossref

Hybrid Statistical Estimation of Mutual Information for Quantifying Information Flow

Author: A Legay
B Barbot
B Espinoza
C Adami
D Chaum
D Clark
D Clark
DE Denning
DJC MacKay
DR Brillinger
EM Clarke
F Biondi
F Biondi
F Escolano
FV Jensen
G Smith
H Yasuoka
JS Vitter
K Chatzikokolakis
K Chatzikokolakis
M Boreale
MM Wilde
MR Clarkson
R Moddemeijer
RG Gallager
S Chakraborty
S Chakraborty
T Chothia
T Chothia
TM Cover
Y Kawamoto
Publication venue: HAL CCSD
Publication date: 01/09/2016
Field of study

Analysis of a probabilistic system often requires to learn the joint probability distribution of its random variables. The computation of the exact distribution is usually an exhaustive precise analysis on all executions of the system. To avoid the high computational cost of such an exhaustive search, statistical analysis has been studied to efficiently obtain approximate estimates by analyzing only a small but representative subset of the system's behavior. In this paper we propose a hybrid statistical estimation method that combines precise and statistical analyses to estimate mutual information and its confidence interval. We show how to combine the analyses on different components of the system with different precision to obtain an estimate for the whole system. The new method performs weighted statistical analysis with different sample sizes over different components and dynamically finds their optimal sample sizes. Moreover it can reduce sample sizes by using prior knowledge about systems and a new abstraction-then-sampling technique based on qualitative analysis. We show the new method outperforms the state of the art in quantifying information leakage

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1